Machine Learning 2

Exercise 1

logistic regression and gradient descent

1 - Lecture notes exercise 1

Solve exercise 1 in the lecture notes.

Solution:

link to pdf

...................................................................................................

In order to apply logistic regression we need to know how to optimize functions - in our case the logistic regression loss (3.11) in the lecture notes. If you already have experience in optimization you may not need the following two assignments.

  • Thanks to David for task 2 and 3.

2 - Calculate some gradients

a) Calculate the gradients of the following functions

$$f(x, y) = \frac{1}{x^2+y^2}$$

and $$f(x, y) = x^2y.$$

b) A standard way to computationally find a minimum is gradient descent.
Start at some (possibly random) point $ \overrightarrow{p}=(x,y)^T $ and move downwards, i.e. in negative gradient direction. The stepsize $\lambda$ should be controlled or small enough. When a Loss function is optimized in Machine Learning context $\lambda$ is also called the Learning Rate.

The update equation

$$ \overrightarrow{p_{i+1}}= \overrightarrow{p_{i}} - \lambda \cdot \nabla f(x,y)$$

is then iterated until the norm of the gradient is below some threshold.

Write down the update equations for the two functions in a)!

Solution:

a) For $f(x, y) = \frac{1}{x^2+y^2}$ follows

$$\nabla f(x, y) = \begin{pmatrix}-\frac{2x}{(x^2+y^2)^2} \\ -\frac{2y}{(x^2+y^2)^2}\end{pmatrix}$$

.

$f(x, y) = x^2y$ results in

$$\nabla f(x,y) = \begin{pmatrix}2xy \\ x^2\end{pmatrix}$$

b) The Update-Equations for $f(x, y) = \frac{1}{x^2+y^2}$ are $$x \leftarrow x + \alpha \frac{2x}{(x^2+y^2)^2}$$ $$x \leftarrow y + \alpha \frac{2y}{(x^2+y^2)^2}$$

The Update-Equations for $f(x, y) = x^2y$ are

$$x \leftarrow x - 2\alpha xy$$$$y \leftarrow y - \alpha x^2$$

3 - Visualization of gradient descent

For this task we use the double well potential

$$V(x) = ax^4 + bx^2 + cx + d$$

with $a = 1$, $b = -3$, $c =1$ and $d = 3.514$.

We seek to find the global minimum $x_{min}$ of this function with gradient descent. (In 1D the gradient is just the derivative.)

a) Calculate the derivative of $V(x)$ and the update equation for $x$ with learning rate $\lambda$.

b) Complete the code below.

c) Test the different starting points and $\lambda$:

$$(x_0, \lambda) = (-1.75, 0.001)$$$$(x_0, \lambda) = (-1.75, 0.19) $$$$(x_0, \lambda) = (-1.75, 0.1) $$$$(x_0, \lambda) = (-1.75, 0.205)$$

d) How to actually find a compromize between $(x_0, \lambda) = (-1.75, 0.001)$ and $(x_0, \lambda) = (-1.75, 0.19)$ ?

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def update2(x, a, b, c, d, lam):
  x = x - lam*(4*a*x**3 + 2*b*x + c)

  return x

def V(x, a, b, c, d):
  return a*x**4 + b*x**2 + c*x + d

# TODO: Change to right parameters.
a = 1
b = -3
c = 1
d = 3.514

x0 = -1.75
iterations = 101
lams = np.array([0.001, 0.19, 0.1, 0.205])

losses = np.empty(shape=(iterations, len(lams)))
results = np.empty(len(lams))

for j in range(len(lams)):
  x = x0
  lam = lams[j]
  for i in range(iterations):
    losses[i, j] = V(x, a, b, c, d)
    if i != iterations - 1:
      x = update2(x, a, b, c, d, lam)
  results[j] = x

for j in range(len(lams)):
  print(100*"-")
  print("lambdas: ", lams[j])
  print("xmin: ", results[j])
  print("Loss: ", V(results[j], a, b, c, d))

colors = {
    0.001: "blue",
    0.19: "red",
    0.1: "black",
    0.205: "orange"
}

plt.figure(figsize=(8, 8))
plt.title("Lernkurven")
plt.xlabel("Epoche")
plt.ylabel("Loss V")
plt.xlim(0, iterations)

for i in range(len(lams)):
  lam = lams[i]
  plt.plot(range(iterations), losses[:, i], label=str(lam), color=colors[lam])

plt.legend()
plt.ylim(bottom=0)
plt.show()

plt.figure(figsize=(8, 8))
plt.title("Funktion V und Minima")
plt.xlabel("x")
plt.ylabel("V(x)")

xs = np.linspace(-2, 2, 100)
ys = V(xs, a, b, c, d)

plt.plot(xs, ys)

for j in range(len(lams)):
  lam = lams[j]
  xmin = results[j]
  vxmin = V(xmin, a, b, c, d)
  plt.plot(xmin, vxmin, marker='.', linestyle="None", label=str(lam), color=colors[lam], ms=10)
plt.legend()
plt.show()
----------------------------------------------------------------------------------------------------
lambdas:  0.001
xmin:  -1.376518421356889
Loss:  0.04335095929070443
----------------------------------------------------------------------------------------------------
lambdas:  0.19
xmin:  -0.8912145169689081
Loss:  0.8708497512243683
----------------------------------------------------------------------------------------------------
lambdas:  0.1
xmin:  -1.300839565941577
Loss:  9.496106521122982e-05
----------------------------------------------------------------------------------------------------
lambdas:  0.205
xmin:  1.1308872611718561
Loss:  2.443769819121824

Solution:

a) The derivative is

$$\partial_x V(x) = 4ax^3 + 2bx + c$$

.

The update equation thus is

$$x \leftarrow x - \lambda \left(4ax^3 + 2bx + c\right)$$

c)

$(x_0, \lambda) = (-1.75, 0.001)$: left (global) minimum is found very slowly ($\lambda$ too small).

$(x_0, \lambda) = (-1.75, 0.19)$: No minimum is found. Parameter $x$ jumps around in the left valley ($\lambda$ too big).

$(x_0, \lambda) = (-1.75, 0.1)$: left minimum is found.

$(x_0, \lambda) = (-1.75, 0.205)$: Jumps over left min - local right minimum is found.

d) Adjust $\lambda$. Start with larger $\lambda$ and reduce e.g. every $n$ steps by some factor.

$\lambda \leftarrow f\cdot\lambda$ every $n$ epochs.

Even better: monitor the reduction of the loss - reduce $\lambda$ when necesarry and increase when possible.

4 - Logistic Regression

Consider two 1D Normal Distributions with $\sigma^2=1$ located at $\mu_1=0.0$ and $\mu_2=2.0$. Sample N values from each of these distributions and assign class label "0" and "1" to the values ("0" for the values coming from the normal distribution at "0"). Let this be your labeled data. Learn a logistic regression model with these data. Choose N=5 and N=100.

At which location is the 50% decision for your class label beeing "0" (and "1")?

Hints:

  • data1 = numpy.random.normal(mu1, sigma1, sizeN)
  • You see from the question how to choose your linear model (3.1) in the lecture notes: there is a const term $\theta_0$, i.e. (3.1.) becomes $<\theta, \hat x> = (\theta_1, \theta_0) \cdot (x,1)^T=\theta_1 \cdot x +\theta_0$.

Solution:

link to Python file

5 - Logistic Regression scikit-learn example

Run and understand the example "MNIST classification using multinomial logistic regression" from scikit-learn.

In [ ]: